Customer Segmentation and Lifetime Value Prediction in Telecom Industry
¶

Problem Statement:¶

The telecom industry is highly competitive, and understanding customer behavior is crucial for reducing churn and maximizing customer lifetime value (LTV). This project aims to analyze customer data to perform segmentation based on usage patterns and demographic information. Additionally, the project will calculate and predict the lifetime value of customers, enabling the telecom company to identify high-value customers and develop targeted retention strategies.

Objective for the TelecomChurn Project¶

The primary objective of the TelecomChurn project is to enhance customer retention and maximize lifetime value (LTV) by leveraging advanced customer segmentation and predictive analytics. This will be achieved through the following key goals:

  1. Customer Segmentation:

    • Purpose: Categorize telecom customers into distinct segments based on demographic and usage attributes to better understand customer profiles and tailor marketing and retention strategies.
    • Method:
      • Utilize attributes such as age, gender, location, salary, dependents, calls, SMS, and data usage.
      • Implement RFM (Recency, Frequency, Monetary value) analysis to gauge customer engagement levels.
      • Apply clustering techniques like K-means to identify unique customer segments.
  2. Calculate and Predict Customer Lifetime Value (LTV):

    • Purpose: Determine and forecast the net profit attributed to the entire future relationship with each customer, crucial for understanding profitability and retention priorities.
    • Method:
      • Use SQL queries to aggregate usage data and calculate LTV.
      • Develop predictive models based on historical data and customer segments to estimate future LTV.
  3. Targeted Retention Strategies:

    • Purpose: Identify high-value customers and prioritize retention efforts to reduce churn and optimize revenue.
    • Method:
      • Segment customers by their behavior patterns and LTV to create targeted marketing campaigns.
      • Use LTV data in conjunction with churn indicators to build models that predict churn risk and enable proactive retention initiatives.

By implementing these strategies, the TelecomChurn project aims to provide actionable insights into customer behavior, enhance customer engagement, and significantly improve the telecom company's competitive edge in the market.

So, join us as we embark on this data-driven journey, exploring the behaviors and trends within the telecom industry. We aim to uncover valuable insights into customer segmentation and predict customer lifetime value, providing a deeper understanding of customer retention and profitability.

Project Contributors
¶

`1. Daniel McDinna  2. Oluwakayode Onakoya  3. Ebunoluwa Adebukola  4. Anuoluwa Ojebode  5. Adekeye AbdulRasheed  
In [29]:
# Let's begin!
#Import neccessary libraries for our analysis and insight exploration on the Dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
from math import pi
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
import sqlite3
import math
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
In [2]:
# Read in the dataset to our notebook

telecom_data = pd.read_csv(r"C:\Users\USER PC\OneDrive\Documents\Business and Education\INGRYD\Project\Capstone Project\Telecom_Churn.csv")
telecom_data
Out[2]:
customer_id telecom_partner gender age state city pincode date_of_registration num_dependents estimated_salary calls_made sms_sent data_used churn
0 1 Reliance Jio F 25 Karnataka Kolkata 755597 2020-01-01 4 124962 44 45 -361 0
1 2 Reliance Jio F 55 Mizoram Mumbai 125926 2020-01-01 2 130556 62 39 5973 0
2 3 Vodafone F 57 Arunachal Pradesh Delhi 423976 2020-01-01 0 148828 49 24 193 1
3 4 BSNL M 46 Tamil Nadu Kolkata 522841 2020-01-01 1 38722 80 25 9377 1
4 5 BSNL F 26 Tripura Delhi 740247 2020-01-01 2 55098 78 15 1393 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
243548 243549 Airtel F 28 Mizoram Kolkata 110295 2023-05-03 3 130580 28 9 4102 0
243549 243550 Reliance Jio F 52 Assam Kolkata 713481 2023-05-03 0 82393 80 45 7521 0
243550 243551 Reliance Jio M 59 Tripura Kolkata 520218 2023-05-03 4 51298 26 4 6547 0
243551 243552 BSNL M 49 Madhya Pradesh Kolkata 387744 2023-05-03 2 83981 80 15 1125 0
243552 243553 BSNL F 37 Telangana Hyderabad 139086 2023-05-04 0 144297 61 7 3384 0

243553 rows × 14 columns

Data Requirements¶

Dataset: The provided telecom churn dataset includes the following attributes:¶
  • customer_id: Unique identifier for each customer.
  • telecom_partner: Partner telecom company.
  • gender: Gender of the customer.
  • age: Age of the customer.
  • state: State where the customer resides.
  • city: City where the customer resides.
  • pincode: Postal code of the customer's residence.
  • date_of_registration: Date when the customer registered.
  • num_dependents: Number of dependents the customer has.
  • estimated_salary: Estimated salary of the customer.
  • calls_made: Number of calls made by the customer.
  • sms_sent: Number of SMS sent by the customer.
  • data_used: Amount of data used by the customer.
  • churn: Indicator of whether the customer has churned.
In [3]:
# We delve into the info of the Dataset
telecom_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243553 entries, 0 to 243552
Data columns (total 14 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   customer_id           243553 non-null  int64 
 1   telecom_partner       243553 non-null  object
 2   gender                243553 non-null  object
 3   age                   243553 non-null  int64 
 4   state                 243553 non-null  object
 5   city                  243553 non-null  object
 6   pincode               243553 non-null  int64 
 7   date_of_registration  243553 non-null  object
 8   num_dependents        243553 non-null  int64 
 9   estimated_salary      243553 non-null  int64 
 10  calls_made            243553 non-null  int64 
 11  sms_sent              243553 non-null  int64 
 12  data_used             243553 non-null  int64 
 13  churn                 243553 non-null  int64 
dtypes: int64(9), object(5)
memory usage: 26.0+ MB
In [4]:
# Get the full description of the Dataset
telecom_data.describe()
Out[4]:
customer_id age pincode num_dependents estimated_salary calls_made sms_sent data_used churn
count 243553.000000 243553.000000 243553.000000 243553.000000 243553.000000 243553.000000 243553.000000 243553.000000 243553.000000
mean 121777.000000 46.077609 549501.270541 1.997500 85021.137839 49.010548 23.945404 4993.186025 0.200478
std 70307.839393 16.444029 259808.860574 1.414941 37508.963233 29.453556 14.733575 2942.019547 0.400359
min 1.000000 18.000000 100006.000000 0.000000 20000.000000 -10.000000 -5.000000 -987.000000 0.000000
25% 60889.000000 32.000000 324586.000000 1.000000 52585.000000 24.000000 11.000000 2490.000000 0.000000
50% 121777.000000 46.000000 548112.000000 2.000000 84990.000000 49.000000 24.000000 4987.000000 0.000000
75% 182665.000000 60.000000 774994.000000 3.000000 117488.000000 74.000000 36.000000 7493.000000 0.000000
max 243553.000000 74.000000 999987.000000 4.000000 149999.000000 108.000000 53.000000 10991.000000 1.000000
In [5]:
# chek for null values in the dataset
telecom_data.isna().sum()
Out[5]:
customer_id             0
telecom_partner         0
gender                  0
age                     0
state                   0
city                    0
pincode                 0
date_of_registration    0
num_dependents          0
estimated_salary        0
calls_made              0
sms_sent                0
data_used               0
churn                   0
dtype: int64
In [89]:
# Select numerical columns for outlier detection

numerical_columns = ['customer_id', 'age', 'pincode', 'num_dependents', 'estimated_salary', 'sms_sent', 'data_used', 'churn']

# Function to detect outliers using IQR method

def detect_outliers(df, cols):
    outliers = pd.DataFrame()
    for col in cols:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        outliers_col = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outliers = pd.concat([outliers, outliers_col], axis=0)
    return outliers


# Detect outliers in numerical columns
outliers_df = detect_outliers(telecom_data, numerical_columns)


# Display the outliers
print("Below are the outliers in the Telecom Churn dataset:")
outliers_df
Below are the outliers in the Telecom Churn dataset:
Out[89]:
customer_id telecom_partner gender age state city pincode date_of_registration num_dependents estimated_salary ... monthly_calls monthly_sms monthly_data R F M RFM_Segment RFM_Score Cluster LTV
2 3 Vodafone F 57 Arunachal Pradesh Delhi 423976 2020-01-01 0 148828 ... 0.912477 0.446927 3.594041 4 1 4 414 9 0 1.747473e+08
3 4 BSNL M 46 Tamil Nadu Kolkata 522841 2020-01-01 1 38722 ... 1.489758 0.465549 174.618250 4 4 1 441 9 1 1.620697e+09
7 8 BSNL M 46 Arunachal Pradesh Kolkata 866786 2020-01-01 4 104541 ... 1.620112 0.744879 41.806331 4 1 3 413 8 3 1.094575e+09
10 11 Airtel M 44 Uttarakhand Chennai 776250 2020-01-01 0 133288 ... 0.130354 0.130354 23.743017 4 1 4 414 9 0 7.583823e+08
16 17 Vodafone M 61 Himachal Pradesh Hyderabad 734068 2020-01-01 2 59723 ... 0.297952 0.782123 151.899441 4 4 2 442 10 2 2.165676e+09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
243523 243524 Reliance Jio F 38 Gujarat Delhi 112512 2023-05-03 0 69163 ... 6.564885 0.763359 559.007634 1 3 2 132 6 2 5.526996e+08
243527 243528 BSNL F 50 Uttar Pradesh Mumbai 613558 2023-05-03 3 126994 ... 2.977099 0.152672 243.206107 1 2 4 124 7 0 4.414201e+08
243529 243530 Reliance Jio M 45 Punjab Kolkata 190431 2023-05-03 2 110127 ... 4.427481 1.832061 217.709924 1 2 3 123 6 3 3.480357e+08
243533 243534 Reliance Jio F 22 West Bengal Kolkata 732518 2023-05-03 2 123399 ... 5.267176 0.687023 34.198473 1 1 4 114 6 0 6.991451e+07
243535 243536 BSNL M 32 Andhra Pradesh Delhi 439538 2023-05-03 1 104748 ... 4.732824 1.145038 362.061069 1 2 3 123 6 3 5.438294e+08

48827 rows × 28 columns

In [7]:
# Outliers count
outliers_df.count()
Out[7]:
customer_id             48827
telecom_partner         48827
gender                  48827
age                     48827
state                   48827
city                    48827
pincode                 48827
date_of_registration    48827
num_dependents          48827
estimated_salary        48827
calls_made              48827
sms_sent                48827
data_used               48827
churn                   48827
dtype: int64

Exploring Categorical Variables¶

In [76]:
# Distribution of 'gender'
sns.countplot(x='gender', data=telecom_data)
plt.title('Distribution of Gender')
plt.show()

# Distribution of 'state'
plt.figure(figsize=(12, 6))
sns.countplot(x='state', data=telecom_data, order=telecom_data['state'].value_counts().index)
plt.title('Distribution of State')
plt.xticks(rotation=90)
plt.show()
  • We had more males than females
  • Data accross all states are slightly different by count

Exploring Numerical Variables¶

In [60]:
# Distribution of 'age'
sns.histplot(telecom_data['age'], bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Customer Count')
plt.show()

# Distribution of 'calls_made'
sns.histplot(telecom_data['calls_made'], bins=30, kde=True)
plt.title('Distribution of Calls Made')
plt.xlabel('Calls Made')
plt.ylabel('Customer Count')
plt.show()

# Distribution of 'sms_sent'
sns.histplot(telecom_data['sms_sent'], bins=30, kde=True)
plt.title('Distribution of SMS Sent')
plt.xlabel('SMS Sent')
plt.ylabel('Customer Count')
plt.show()

# Distribution of 'data_used'
sns.histplot(telecom_data['data_used'], bins=30, kde=True)
plt.title('Distribution of Data Used')
plt.xlabel('Data Used')
plt.ylabel('Customer Count')
plt.show()

Customer segmentation¶

RFM Analysis¶

  • Calculate RFM Scores:

    Rank each customer based on Recency, Frequency, and Monetary value.

In [8]:
# Convert 'date_of_registration' to datetime
telecom_data['date_of_registration'] = pd.to_datetime(telecom_data['date_of_registration'])

# Assuming the current date is known
current_date = datetime.now()

# Recency: Days since last registration
telecom_data['recency'] = (current_date - telecom_data['date_of_registration']).dt.days

# Frequency: Assuming monthly interaction frequency over the period
# Here, you might need to adjust based on real transaction data if available
telecom_data['frequency'] = telecom_data['calls_made'] + telecom_data['sms_sent'] + telecom_data['data_used']

# Monetary: Total data usage can be considered as a proxy for monetary value
telecom_data['monetary'] = telecom_data['estimated_salary']

# Lets break down our df into more colums for simplicity
telecom_data['tenure_days'] = (pd.Timestamp.now() - telecom_data['date_of_registration']).dt.days
telecom_data['monthly_calls'] = telecom_data['calls_made'] / (telecom_data['tenure_days'] / 30)
telecom_data['monthly_sms'] = telecom_data['sms_sent'] / (telecom_data['tenure_days'] / 30)
telecom_data['monthly_data'] = telecom_data['data_used'] / (telecom_data['tenure_days'] / 30)
#telecom_data['LTV'] = telecom_data['estimated_salary'] * (telecom_data['tenure_days'] / 30)

# Define quantiles for RFM ranking
r_labels = range(1, 5)
f_labels = range(1, 5)
m_labels = range(1, 5)

# Assign quantiles to RFM values
telecom_data['R'] = pd.qcut(telecom_data['recency'], q=4, labels=r_labels).astype(int)
telecom_data['F'] = pd.qcut(telecom_data['frequency'].rank(method='first'), q=4, labels=f_labels).astype(int)
telecom_data['M'] = pd.qcut(telecom_data['monetary'].rank(method='first'), q=4, labels=m_labels).astype(int)

# Create RFM segment and score
telecom_data['RFM_Segment'] = telecom_data['R'].astype(str) + telecom_data['F'].astype(str) + telecom_data['M'].astype(str)
telecom_data['RFM_Score'] = telecom_data[['R', 'F', 'M']].sum(axis=1)

# Display the first few rows to confirm RFM scores
telecom_data[['customer_id', 'RFM_Segment', 'RFM_Score']].head()
Out[8]:
customer_id RFM_Segment RFM_Score
0 1 414 9
1 2 434 11
2 3 414 9
3 4 441 9
4 5 412 7
In [9]:
# Display the first few rows to confirm new features
telecom_data[['customer_id', 'recency', 'frequency', 'monetary']].head()
Out[9]:
customer_id recency frequency monetary
0 1 1611 -272 124962
1 2 1611 6074 130556
2 3 1611 266 148828
3 4 1611 9482 38722
4 5 1611 1486 55098
In [10]:
# Display the first few rows to confirm new features
telecom_data[['customer_id', 'R', 'F', 'M']].head()
Out[10]:
customer_id R F M
0 1 4 1 4
1 2 4 3 4
2 3 4 1 4
3 4 4 4 1
4 5 4 1 2
In [30]:
# Define a list of colors
colors = ['blue', 'orange', 'green', 'red', 'purple', 'brown', 'pink', 'gray', 'olive', 'cyan', 
          'yellow', 'black', 'magenta', 'turquoise', 'lime']

# Calculate the number of rows needed
num_columns = len(telecom_data.columns) - 1
num_rows = math.ceil(num_columns / 3)

plt.figure(figsize=(20, 5 * num_rows))

# Loop through columns and plot histograms
i = 1
for col in telecom_data.columns[:-1]:
    plt.subplot(num_rows, 3, i)
    sns.histplot(telecom_data[col], color=colors[i % len(colors)])  # Use a different color for each plot
    plt.title(col)
    i += 1

plt.tight_layout()
plt.show()
  • This gives us a visual representation of each column as we proceed.
In [11]:
# Lets Create a new DF for data transformation, and to keep the original df unaltered any further
# Then try nurmalization on the new df 
scaler_df = pd.read_csv(r"C:\Users\USER PC\OneDrive\Documents\Business and Education\INGRYD\Project\Capstone Project\Telecom_Churn_Clone.csv")

# Normalization
scaler = StandardScaler()
scaler_df[['age', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used']] = scaler.fit_transform(telecom_data[['age', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used']])

# Display the first few rows to confirm transformations
scaler_df.head()
Out[11]:
customer_id telecom_partner gender age state city pincode date_of_registration num_dependents estimated_salary calls_made sms_sent data_used churn
0 1 Reliance Jio F -1.281782 Karnataka Kolkata 755597 2020-01-01 4 1.064837 -0.170117 1.429025 -1.819905 0
1 2 Reliance Jio F 0.542593 Mizoram Mumbai 125926 2020-01-01 2 1.213975 0.441016 1.021791 0.333042 0
2 3 Vodafone F 0.664218 Arunachal Pradesh Delhi 423976 2020-01-01 0 1.701113 -0.000358 0.003706 -1.631599 1
3 4 BSNL M -0.004720 Tamil Nadu Kolkata 522841 2020-01-01 1 -1.234351 1.052149 0.071578 1.490073 1
4 5 BSNL F -1.220969 Tripura Delhi 740247 2020-01-01 2 -0.797761 0.984245 -0.607145 -1.223715 0
In [12]:
# confirm telecom_data is unaltered
telecom_data.head()
Out[12]:
customer_id telecom_partner gender age state city pincode date_of_registration num_dependents estimated_salary ... monetary tenure_days monthly_calls monthly_sms monthly_data R F M RFM_Segment RFM_Score
0 1 Reliance Jio F 25 Karnataka Kolkata 755597 2020-01-01 4 124962 ... 124962 1611 0.819367 0.837989 -6.722533 4 1 4 414 9
1 2 Reliance Jio F 55 Mizoram Mumbai 125926 2020-01-01 2 130556 ... 130556 1611 1.154562 0.726257 111.229050 4 3 4 434 11
2 3 Vodafone F 57 Arunachal Pradesh Delhi 423976 2020-01-01 0 148828 ... 148828 1611 0.912477 0.446927 3.594041 4 1 4 414 9
3 4 BSNL M 46 Tamil Nadu Kolkata 522841 2020-01-01 1 38722 ... 38722 1611 1.489758 0.465549 174.618250 4 4 1 441 9
4 5 BSNL F 26 Tripura Delhi 740247 2020-01-01 2 55098 ... 55098 1611 1.452514 0.279330 25.940410 4 1 2 412 7

5 rows × 26 columns

In [13]:
# Let's amend scaler_df with R F and M values
scaler_df['R'] = telecom_data['R']
scaler_df['F'] = telecom_data['F']
scaler_df['M'] = telecom_data['M']
In [14]:
scaler_df.head()
Out[14]:
customer_id telecom_partner gender age state city pincode date_of_registration num_dependents estimated_salary calls_made sms_sent data_used churn R F M
0 1 Reliance Jio F -1.281782 Karnataka Kolkata 755597 2020-01-01 4 1.064837 -0.170117 1.429025 -1.819905 0 4 1 4
1 2 Reliance Jio F 0.542593 Mizoram Mumbai 125926 2020-01-01 2 1.213975 0.441016 1.021791 0.333042 0 4 3 4
2 3 Vodafone F 0.664218 Arunachal Pradesh Delhi 423976 2020-01-01 0 1.701113 -0.000358 0.003706 -1.631599 1 4 1 4
3 4 BSNL M -0.004720 Tamil Nadu Kolkata 522841 2020-01-01 1 -1.234351 1.052149 0.071578 1.490073 1 4 4 1
4 5 BSNL F -1.220969 Tripura Delhi 740247 2020-01-01 2 -0.797761 0.984245 -0.607145 -1.223715 0 4 1 2
In [22]:
# Plot histograms for Recency, Frequency, and Monetary
plt.figure(figsize=(15, 5))

# Recency
plt.subplot(1, 3, 1)
plt.hist(telecom_data['recency'], bins=20, color='blue')
plt.title('Distribution of Recency')
plt.xlabel('Days Since Last Registration')
plt.ylabel('Number of Customers')

# Frequency
plt.subplot(1, 3, 2)
plt.hist(telecom_data['frequency'], bins=20, color='red')
plt.title('Distribution of Frequency')
plt.xlabel('Number of Interactions')
plt.ylabel('Number of Customers')

# Monetary
plt.subplot(1, 3, 3)
plt.hist(telecom_data['monetary'], bins=20, color='green')
plt.title('Distribution of Monetary Value')
plt.xlabel('Estimated Salary')
plt.ylabel('Number of Customers')

plt.tight_layout()
plt.show()
In [18]:
# Compute the correlation matrix
corr_matrix = telecom_data[['recency', 'frequency', 'monetary', 'R', 'F', 'M']].corr()

# Generate a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of RFM Features')
plt.show()
From the heatmap, we deduce that...: 
  • Monetary, frequency and recency are all weakly correlated
  • There is a strong correlation between the recency, frequency and monetary, and their R, F, and M values..
In [21]:
# Box plots for Recency, Frequency, and Monetary
plt.figure(figsize=(15, 5))

# Recency
plt.subplot(1, 3, 1)
sns.boxplot(y=telecom_data['recency'], color='blue')
plt.title('Box Plot of Recency')

# Frequency
plt.subplot(1, 3, 2)
sns.boxplot(y=telecom_data['frequency'], color='red')
plt.title('Box Plot of Frequency')

# Monetary
plt.subplot(1, 3, 3)
sns.boxplot(y=telecom_data['monetary'], color='green')
plt.title('Box Plot of Monetary Value')

plt.tight_layout()
plt.show()


print(f'From the chart, we deduce that...: ')
From the chart, we deduce that...: 
In [58]:
# Pair-plots for Recency, Frequency, and Monetary
sns.pairplot(telecom_data[['recency', 'frequency', 'monetary']])
plt.suptitle('Pair Plot of RFM Features', y=1.02)
plt.show()

print(f'From the chart, we deduce that: monetary, frequency and recency are all weakly correlated')
From the chart, we deduce that: monetary, frequency and recency are all weakly correlated

Churn Analysis¶

In [36]:
# We aim at getting all the churn count and the respective telecom partners and other parameters
# Select specific columns for churned customers
churn_data = telecom_data[telecom_data['churn'] == 1][['telecom_partner', 'age', 'gender', 'estimated_salary', 'state', 'data_used']]

# Select specific columns for non-churned customers
non_churn_data = telecom_data[telecom_data['churn'] == 0][['telecom_partner', 'age', 'gender', 'estimated_salary', 'state', 'data_used']]

# Display the churn data
churn_data
Out[36]:
telecom_partner age gender estimated_salary state data_used
2 Vodafone 57 F 148828 Arunachal Pradesh 193
3 BSNL 46 M 38722 Tamil Nadu 9377
7 BSNL 46 M 104541 Arunachal Pradesh 2245
10 Airtel 44 M 133288 Uttarakhand 1275
16 Vodafone 61 M 59723 Himachal Pradesh 8157
... ... ... ... ... ... ...
243523 Reliance Jio 38 F 69163 Gujarat 7323
243527 BSNL 50 F 126994 Uttar Pradesh 3186
243529 Reliance Jio 45 M 110127 Punjab 2852
243533 Reliance Jio 22 F 123399 West Bengal 448
243535 BSNL 32 M 104748 Andhra Pradesh 4743

48827 rows × 6 columns

In [38]:
# View a list of telecom pateners with churn customers
company_names= list(churn_data['telecom_partner'].value_counts().keys())
company_names
Out[38]:
['Airtel', 'Reliance Jio', 'Vodafone', 'BSNL']
In [40]:
# Plot a chart for churn distribution
# Get the value counts for each telecom partner in churn_data
company_churn_counts = churn_data['telecom_partner'].value_counts()

# Get the values and labels for the pie chart
company_churn_values = company_churn_counts.values
company_names = company_churn_counts.index

# Plot a chart for churn distribution
plt.figure(figsize=(10,8))
plt.pie(company_churn_values, labels=company_names, autopct='%.1f%%', labeldistance=1.15, wedgeprops={'linewidth': 3, 'edgecolor': 'white'})
plt.title("Churn Percentage of Telecom Partners")
plt.show()
  • From the chart, we realize that all the Telecom partners have approximatly the same share in case of churn customer.
  • Four Telecom Partners had churn customers - 'Airtel', 'Reliance Jio', 'Vodafone', 'BSNL'
In [41]:
# Get the value counts for each gender in churn_data
gender_churn_counts = churn_data['gender'].value_counts()

# Get the values and labels for the pie chart
gender_churn_values = gender_churn_counts.values
gender_labels = gender_churn_counts.index

# Plot a chart for churn distribution by gender
plt.figure(figsize=(10, 8))
plt.pie(gender_churn_values, labels=gender_labels, autopct='%.1f%%', labeldistance=1.15, wedgeprops={'linewidth': 3, 'edgecolor': 'white'})
plt.title("Churn Percentage by Gender")
plt.show()
  • From the chart, we realize that we had more male churn customers than female
In [43]:
# Plot histograms for age distribution
plt.figure(figsize=(14, 6))

# Churned customers
plt.subplot(1, 2, 1)
sns.histplot(churn_data['age'], kde=True, color='red', bins=30)
plt.title('Age Distribution of Churned Customers')
plt.xlabel('Age')
plt.ylabel('Frequency')

# Non-churned customers
plt.subplot(1, 2, 2)
sns.histplot(non_churn_data['age'], kde=True, color='blue', bins=30)
plt.title('Age Distribution of Non-Churned Customers')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
In [44]:
# Plot histograms for estimated salary distribution
plt.figure(figsize=(14, 6))

# Churned customers
plt.subplot(1, 2, 1)
sns.histplot(churn_data['estimated_salary'], kde=True, color='red', bins=30)
plt.title('Estimated Salary Distribution of Churned Customers')
plt.xlabel('Estimated Salary')
plt.ylabel('Frequency')

# Non-churned customers
plt.subplot(1, 2, 2)
sns.histplot(non_churn_data['estimated_salary'], kde=True, color='blue', bins=30)
plt.title('Estimated Salary Distribution of Non-Churned Customers')
plt.xlabel('Estimated Salary')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
  • From the chart, we realize that the churn customers have less salary in compararism to non-churn customers
In [48]:
# Get sorted value counts for data used
churn_value_data_used = churn_data['data_used'].value_counts().sort_index()
non_churn_value_data_used = non_churn_data['data_used'].value_counts().sort_index()

# Create a DataFrame for plotting
churn_df = pd.DataFrame({'data_used': churn_value_data_used.index, 'count': churn_value_data_used.values, 'churn_status': 'churn'})
non_churn_df = pd.DataFrame({'data_used': non_churn_value_data_used.index, 'count': non_churn_value_data_used.values, 'churn_status': 'non_churn'})
combined_df = pd.concat([churn_df, non_churn_df])

# Plot the scatter plot
plt.figure(figsize=(10, 5))
sns.scatterplot(data=combined_df, x='data_used', y='count', hue='churn_status')
plt.xlabel('Data Used')
plt.ylabel('Count')
plt.title('Scatter Plot of Data Usage by Churn and Non-Churn Customers')
plt.legend(title='Churn Status')
plt.show()

# Plot a clearer scatter plot
churn_value_calls_made=list(churn_data['data_used'].value_counts().sort_values())
non_churn_value_calls_made=list(non_churn_data['data_used'].value_counts().sort_values())
plt.figure(figsize=(10,5))
sns.scatterplot(churn_value_calls_made,label="churn")
sns.scatterplot(non_churn_value_calls_made,label="non churn")
plt.xlabel("Data used")
plt.title("scatterplot data usage made churn  and non-churn customer")
Out[48]:
Text(0.5, 1.0, 'scatterplot data usage made churn  and non-churn customer')
  • Findings: Non-churn customers used more data compared to churned customers.
In [62]:
# Plot a chart for churn distribution
# Get the value counts for each telecom partner in churn_data
churn_value_state_count = churn_data['state'].value_counts()

# Get the values and labels for the pie chart
churn_value_state_value = churn_value_state_count.values
state_names = churn_value_state_count.index


plt.figure(figsize=(20, 10))
plt.pie(churn_value_state_value,
        autopct='%.1f%%',
        labels=state_names,
        labeldistance=1.15,
        wedgeprops={'linewidth': 3, 'edgecolor': 'white'}
        )
plt.title("Churn Percentage of States")
plt.show()
  • All states share approximately the same percentage of churn customers (between 3.4 and 3.8)
In [54]:
# Distribution of churn
plt.figure(figsize=(8, 6))
sns.countplot(data=telecom_data, x='churn')
plt.title('Distribution of Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.show()

# Relationship between churn and estimated_salary
plt.figure(figsize=(10, 6))
sns.boxplot(data=telecom_data, x='churn', y='estimated_salary')
plt.title('Estimated Salary vs Churn')
plt.xlabel('Churn')
plt.ylabel('Estimated Salary')
plt.show()

# Relationship between churn and RFM metrics
plt.figure(figsize=(14, 10))

plt.subplot(3, 1, 1)
sns.boxplot(data=telecom_data, x='churn', y='recency')
plt.title('Recency vs Churn')
plt.xlabel('Churn')
plt.ylabel('Recency')

plt.subplot(3, 1, 2)
sns.boxplot(data=telecom_data, x='churn', y='frequency')
plt.title('Frequency vs Churn')
plt.xlabel('Churn')
plt.ylabel('Frequency')

plt.subplot(3, 1, 3)
sns.boxplot(data=telecom_data, x='churn', y='monetary')
plt.title('Monetary Value vs Churn')
plt.xlabel('Churn')
plt.ylabel('Monetary Value')

plt.tight_layout()
plt.show()
  • Sumarizes the previouse charts

Clustering Algorithm to identify customer segments¶

  • n_clusters: This parameter specifies the number of clusters you want the algorithm to identify in your dataset. In this case, n_clusters=4 means you're instructing KMeans to find four distinct clusters among your data points.
  • n_init: KMeans uses an iterative approach to find the best clustering solution. It initializes the centroids randomly and then iteratively improves them. n_init specifies the number of times the algorithm will run with different centroid seeds. The final result will be the best output in terms of inertia (within-cluster sum of squares) among n_init runs. Setting n_init=10 means the algorithm will run ten times and choose the clustering solution with the lowest inertia.
  • random_state: This parameter sets the seed for the random number generator. When you specify random_state=42, you ensure reproducibility of results. In other words, if you run the algorithm with the same random_state value multiple times, you'll get the same clustering results, which is useful for debugging and sharing code.

Explanation of Clustering Steps¶

  1. Select Features for Clustering: Use the recency, frequency, and monetary features for clustering.
  2. Elbow Method: Determine the optimal number of clusters by plotting the Sum of Squared Errors (SSE) for different numbers of clusters.
  3. K-means Clustering: Apply K-means clustering with the optimal number of clusters (assumed to be 4 in this example).
  4. Add Cluster Labels: Assign cluster labels to each customer and add them to the telecom_data DataFrame.

Let us find the optimal number of clusters using the elbow method and then apply K-means clustering to segment our customers.

In [49]:
# Select features for clustering
rfm_features = telecom_data[['recency', 'frequency', 'monetary']]

# Determine the optimal number of clusters using the elbow method
sse = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=1)
    kmeans.fit(rfm_features)
    sse[k] = kmeans.inertia_

# Plot the elbow curve
plt.figure(figsize=(10, 6))
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.show()

# Apply K-means with the chosen number of clusters (let's say 4 for this example)
optimal_clusters = 4
kmeans = KMeans(n_clusters=optimal_clusters, random_state=1)
telecom_data['Cluster'] = kmeans.fit_predict(rfm_features)

# Display the first few rows to confirm clustering
print(telecom_data[['customer_id', 'Cluster']].head())
   customer_id  Cluster
0            1        0
1            2        0
2            3        0
3            4        1
4            5        2
Insight:¶
  • There is a decline in in the Sum of Squared Errors value from Cluster 1 to 10.
In [50]:
# Display the first few rows to confirm clustering
print(telecom_data[['customer_id', 'Cluster']])
        customer_id  Cluster
0                 1        0
1                 2        0
2                 3        0
3                 4        1
4                 5        2
...             ...      ...
243548       243549        0
243549       243550        2
243550       243551        1
243551       243552        2
243552       243553        0

[243553 rows x 2 columns]
In [51]:
# Calculate cluster centers
cluster_centers = kmeans.cluster_centers_

# Create a DataFrame with the cluster centers
cluster_centers_df = pd.DataFrame(cluster_centers, columns=rfm_features.columns)
cluster_centers_df['Cluster'] = np.arange(1, len(cluster_centers) + 1)

# Plot cluster centers
plt.figure(figsize=(10, 6))
sns.heatmap(cluster_centers_df.set_index('Cluster'), annot=True, cmap='coolwarm')
plt.title('Cluster Centers')
plt.show()
  • The Recency and Frequency accross all clusters are weakly corrolated.
  • We noticed a strong corrolation between the Monetary and the two clusters - 1 and 4. The other clusters had weaker correlation with monetary score.
In [53]:
# Pair plot of RFM features colored by cluster
plt.figure(figsize=(10, 6))
sns.pairplot(telecom_data, vars=['recency', 'frequency', 'monetary'], hue='Cluster', palette='viridis')
plt.title('Pair Plot of RFM Features by Cluster')
plt.show()
<Figure size 1000x600 with 0 Axes>
In [55]:
# Box plots for RFM features by cluster
plt.figure(figsize=(15, 5))

# Recency
plt.subplot(1, 3, 1)
sns.boxplot(x='Cluster', y='recency', data=telecom_data)
plt.title('Recency by Cluster')

# Frequency
plt.subplot(1, 3, 2)
sns.boxplot(x='Cluster', y='frequency', data=telecom_data)
plt.title('Frequency by Cluster')

# Monetary
plt.subplot(1, 3, 3)
sns.boxplot(x='Cluster', y='monetary', data=telecom_data)
plt.title('Monetary Value by Cluster')

plt.tight_layout()
plt.show()
  • All clusters have approximately the same count for frequency and recency.
  • We noticed a decreasing Monetary value from the first cluster.
In [56]:
# Distribution of Age by Cluster

plt.figure(figsize=(10, 6))
sns.histplot(data=telecom_data, x='age', hue='Cluster', multiple='stack')
plt.title('Age Distribution by Customer Segment')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
  • The first cluster -0 (cluster 1), has the highest count by Age disribution, as compared to the next following clusters, where a decreasing frequency is observed.
In [57]:
#Average Monthly Calls by Cluster

plt.figure(figsize=(10, 6))
sns.barplot(data=telecom_data, x='Cluster', y='monthly_calls', errorbar=None)
plt.title('Average Monthly Calls by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average Monthly Calls')
plt.show()
  • All clusters have approximately the same average monthly call count
In [63]:
# Function to create radar chart
def create_radar_chart(row, title, color):
    labels = row.index
    num_vars = len(labels)
    
    # Compute angle for each axis
    angles = [n / float(num_vars) * 2 * pi for n in range(num_vars)]
    angles += angles[:1]
    
    # Create figure
    fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
    
    # Draw one axe per variable and add labels
    plt.xticks(angles[:-1], labels, color='grey', size=8)
    
    # Draw ylabels
    ax.set_rlabel_position(0)
    plt.yticks([0.2, 0.4, 0.6, 0.8, 1.0], ["0.2", "0.4", "0.6", "0.8", "1.0"], color="grey", size=7)
    plt.ylim(0, 1)
    
    # Plot data
    values = row.tolist()
    values += values[:1]
    ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
    ax.fill(angles, values, color=color, alpha=0.4)
    
    plt.title(title, size=20, color=color, y=1.1)

# Assuming cluster_centers_df is your DataFrame with cluster centers and 'Cluster' column
# Normalize cluster centers for radar chart
scaler = StandardScaler()
cluster_centers_normalized = scaler.fit_transform(cluster_centers_df.drop('Cluster', axis=1))
cluster_centers_normalized_df = pd.DataFrame(cluster_centers_normalized, columns=rfm_features.columns)

# Plot radar chart for each cluster
for i in range(len(cluster_centers_df)):
    row = cluster_centers_normalized_df.iloc[i]
    create_radar_chart(row, f'Cluster {i+1}', color=sns.color_palette('viridis', len(cluster_centers_df))[i])

plt.show()

Targeted Marketing Segmentation and Personalized Messaging¶

Let's use the estimated salary of customers, to device a marketing strategy, by segmentation.

In [64]:
print(f'The minimum salary is: {telecom_data["estimated_salary"].min()} and the maximum salary is: {telecom_data["estimated_salary"].max()}')
The minimum salary is: 20000 and the maximum salary is: 149999
In [65]:
# By statistics
min_salary = telecom_data['estimated_salary'].min()
max_salary = telecom_data['estimated_salary'].max()
salary_range = max_salary - min_salary
segment_size = salary_range / 3

low_income = min_salary
medium_income = min_salary + segment_size
high_income = min_salary + 2 * segment_size

print(f"Low income earners: {low_income} to {medium_income}")
print(f"Medium income earners: {medium_income} to {high_income}")
print(f"High income earners: {high_income} to {max_salary}")
Low income earners: 20000 to 63333.0
Medium income earners: 63333.0 to 106666.0
High income earners: 106666.0 to 149999
In [60]:
'''"estimated_salary" is the column representing estimated salary
Let's make High-Income salary > 80000
Mid-Income 50000 <= salary <= 80000
Low-Income salary < 50000
'''

# Define income segments based on estimated salary
def segment_by_salary(salary):
    if salary > 80000:
        return 'High-Income'
    elif 50000 <= salary <= 80000:
        return 'Mid-Income'
    else:
        return 'Low-Income'

# Apply segmentation function to create a new column 'income_segment'
telecom_data['income_segment'] = telecom_data['estimated_salary'].apply(segment_by_salary)

# Create personalized messaging based on income segments
def create_personalized_message(segment):
    if segment == 'High-Income':
        return "Unlock premium services tailored for your lifestyle!"
    elif segment == 'Mid-Income':
        return "Discover value-packed bundles designed for your needs!"
    else:
        return "Affordable plans with great features await you!"

# Apply personalized messaging function to create a new column 'marketing_message'
telecom_data['marketing_message'] = telecom_data['income_segment'].apply(create_personalized_message)

# Display a sample of segmented data with personalized messages
print(telecom_data[['customer_id', 'estimated_salary', 'income_segment', 'marketing_message']].head())
   customer_id  estimated_salary income_segment  \
0            1            124962    High-Income   
1            2            130556    High-Income   
2            3            148828    High-Income   
3            4             38722     Low-Income   
4            5             55098     Mid-Income   

                                   marketing_message  
0  Unlock premium services tailored for your life...  
1  Unlock premium services tailored for your life...  
2  Unlock premium services tailored for your life...  
3    Affordable plans with great features await you!  
4  Discover value-packed bundles designed for you...  

Lifetime Value (LTV) Calculation¶

In [70]:
# Connect to SQLite database
conn = sqlite3.connect('telecom_data.db')

# dataframe is already loaded into `telecom_data`
telecom_data.to_sql('telecom_data', conn, if_exists='replace', index=False)

# Define the SQL query
query = """
WITH AveragePurchaseValue AS (
    SELECT 
        customer_id,
        AVG(estimated_salary) AS avg_purchase_value
    FROM telecom_data
    GROUP BY customer_id
),
PurchaseFrequency AS (
    SELECT 
        customer_id,
        (SUM(calls_made) + SUM(sms_sent)+ SUM(data_used)) AS purchase_frequency
    FROM telecom_data
    GROUP BY customer_id
),
CustomerLifespan AS (
    SELECT 
        customer_id,
        (julianday('now') - julianday(date_of_registration)) AS customer_lifespan
    FROM telecom_data
),
LTV_Calculation AS (
    SELECT 
        a.customer_id,
        a.avg_purchase_value,
        b.purchase_frequency,
        c.customer_lifespan,
        (a.avg_purchase_value * b.purchase_frequency * c.customer_lifespan / 365.0) AS LTV
        
    FROM 
        AveragePurchaseValue a
    JOIN 
        PurchaseFrequency b ON a.customer_id = b.customer_id
    JOIN 
        CustomerLifespan c ON a.customer_id = c.customer_id
    JOIN
        telecom_data d ON a.customer_id = d.customer_id
)
SELECT 
    customer_id,
    LTV
FROM 
    LTV_Calculation
ORDER BY 
    customer_id ASC;
"""

# Execute the query and fetch results
ltv_df = pd.read_sql_query(query, conn)

# Display the LTV DataFrame
print(ltv_df.head())

# Close the connection
conn.close()
   customer_id           LTV
0            1 -1.500345e+08
1            2  3.500385e+09
2            3  1.747473e+08
3            4  1.620697e+09
4            5  3.614089e+08
In [71]:
telecom_data.head()
Out[71]:
customer_id telecom_partner gender age state city pincode date_of_registration num_dependents estimated_salary ... tenure_days monthly_calls monthly_sms monthly_data R F M RFM_Segment RFM_Score Cluster
0 1 Reliance Jio F 25 Karnataka Kolkata 755597 2020-01-01 4 124962 ... 1611 0.819367 0.837989 -6.722533 4 1 4 414 9 0
1 2 Reliance Jio F 55 Mizoram Mumbai 125926 2020-01-01 2 130556 ... 1611 1.154562 0.726257 111.229050 4 3 4 434 11 0
2 3 Vodafone F 57 Arunachal Pradesh Delhi 423976 2020-01-01 0 148828 ... 1611 0.912477 0.446927 3.594041 4 1 4 414 9 0
3 4 BSNL M 46 Tamil Nadu Kolkata 522841 2020-01-01 1 38722 ... 1611 1.489758 0.465549 174.618250 4 4 1 441 9 1
4 5 BSNL F 26 Tripura Delhi 740247 2020-01-01 2 55098 ... 1611 1.452514 0.279330 25.940410 4 1 2 412 7 2

5 rows × 27 columns

In [72]:
ltv_df
Out[72]:
customer_id LTV
0 1 -1.500345e+08
1 2 3.500385e+09
2 3 1.747473e+08
3 4 1.620697e+09
4 5 3.614089e+08
... ... ...
243548 243549 5.821595e+08
243549 243550 6.785698e+08
243550 243551 3.634111e+08
243551 243552 1.103598e+08
243552 243553 5.351703e+08

243553 rows × 2 columns

In [73]:
# Distribution of LTV

plt.figure(figsize=(10, 6))
sns.histplot(data=ltv_df, x='LTV', kde=True)
plt.title('Lifetime Value Distribution')
plt.xlabel('Lifetime Value')
plt.ylabel('Frequency')
plt.show()
In [74]:
telecom_data['LTV'] = ltv_df['LTV']
In [75]:
telecom_data[['customer_id', 'LTV']].head()
Out[75]:
customer_id LTV
0 1 -1.500345e+08
1 2 3.500385e+09
2 3 1.747473e+08
3 4 1.620697e+09
4 5 3.614089e+08
In [76]:
ltv_df[['customer_id','LTV']].head()
Out[76]:
customer_id LTV
0 1 -1.500345e+08
1 2 3.500385e+09
2 3 1.747473e+08
3 4 1.620697e+09
4 5 3.614089e+08
In [ ]:
 

Correlation Analysis¶

In [77]:
# Correlation matrix
#numeric_data = telecom_data.select_dtypes(include=['number'])
numeric_data = telecom_data.select_dtypes(include=['int64', 'float64'])
correlation_matrix = numeric_data.corr()
# Heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='rainbow')
plt.title('Correlation Matrix')
plt.show()

Model Development¶

1. Churn Prediction Model¶

In [82]:
# Identify categorical columns
categorical_cols = telecom_data.select_dtypes(include=['object']).columns

# Identify numerical columns
numerical_cols = telecom_data.select_dtypes(include=['int64', 'float64']).columns.drop('churn')

# Preprocess the data
# Define the preprocessing for numerical features
numerical_transformer = SimpleImputer(strategy='median')

# Define the preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define the model
model = GradientBoostingClassifier(random_state=42)

# Create and evaluate the pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model)
])

# Prepare the features and target variable
X = telecom_data.drop(columns=['churn'])
y = telecom_data['churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
Accuracy: 0.7991418776046478
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix:
 [[38927     1]
 [ 9783     0]]
In [87]:
# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# Preprocess the categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Define the RandomForestClassifier model with preprocessor
rf_classifier = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

2. LTV prediction Model¶

We'll start by choosing a model for LTV prediction. Considering the nature of the problem (predicting a continuous value), regression models such as Linear Regression, Decision Trees, Gradient Boosting, or even Neural Networks can be suitable choices.

Let's choose Gradient Boosting Regressor as it often performs well with complex datasets and can capture nonlinear relationships.

In [88]:
# Prepare the features and target variable
X = telecom_data[['recency', 'frequency', 'monetary']]
y = telecom_data['LTV']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Gradient Boosting Regressor model
gb_regressor = GradientBoostingRegressor()

# Train the model
gb_regressor.fit(X_train, y_train)
Out[88]:
GradientBoostingRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor()
In [ ]:
 

Model Evaluation¶

2. Evaluate the Model¶

We'll evaluate the trained model using metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

Interpretation¶

  • RMSE: This metric gives us an idea of how much our predictions deviate, on average, from the actual LTV values. Lower values indicate better model performance.
  • MAE: This metric represents the average absolute difference between the predicted and actual LTV values. Similar to RMSE, lower values are desired.
In [90]:
# Make predictions on the test set
y_pred = gb_regressor.predict(X_test)

# Calculate RMSE and MAE
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)

print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
Root Mean Squared Error (RMSE): 64992491.76284695
Mean Absolute Error (MAE): 50102285.34452021

Predictive Model Insights¶

Model Evaluation Metrics¶

  • Root Mean Squared Error (RMSE): 64,349.98
  • Mean Absolute Error (MAE): 48,041.38

Interpretation¶

The RMSE and MAE values provide an understanding of the predictive accuracy of our model:

  1. Model Accuracy:

    • The RMSE value of 64,349.98 indicates that, on average, the predictions of our model are off by approximately 64,349.98 units from the actual LTV values.
    • The MAE value of 48,041.38 suggests that, on average, the predictions are off by approximately 48,041.38 units from the actual LTV values.
    • Both RMSE and MAE are relatively high, indicating significant prediction errors.
  2. Error Magnitude:

    • The higher RMSE compared to MAE suggests the presence of some large errors (outliers) in our predictions. RMSE penalizes larger errors more heavily than MAE.
  3. Model Improvement:

    • To improve the model's accuracy, consider:
      • Feature Engineering: Add or transform features to better capture the underlying patterns.
      • Hyperparameter Tuning: Adjust the hyperparameters of the Gradient Boosting Regressor.
      • Model Selection: Try different models (e.g., Random Forest, XGBoost) to find a better fit.
      • Data Quality: Ensure that the data is clean and representative of the problem space.
  4. Business Implications:

    • Given the high prediction errors, decisions based solely on these predictions should be made cautiously.
    • Misidentifying high LTV customers due to prediction errors can affect marketing strategies and customer relationship management.

Next Steps¶

  1. Diagnostic Plots:

    • Actual vs. Predicted Plot: To visually inspect how well the model predictions match the actual values.
    • Residual Plot: To identify any patterns in the errors which might indicate issues like non-linearity or heteroscedasticity.
  2. Cross-Validation:

    • Perform cross-validation to get a more robust estimate of model performance and ensure the model is not overfitting.
  3. Feature Importance Analysis:

    • Analyze feature importance to understand which features are most influential in predicting LTV and refine the feature set if necessary.
  4. Model Tuning and Selection:

    • Experiment with different models and hyperparameters to improve the prediction accuracy.

By addressing these insights and taking steps to refine the model, we can enhance the reliability of our LTV predictions and make more informed business decisions.

In [91]:
#Plot Actual vs. Predicted Values
'''
Visualizing the actual vs. predicted values to help us understand how well your model is performing.'''

plt.scatter(y_test, y_pred)
plt.xlabel('Actual LTV')
plt.ylabel('Predicted LTV')
plt.title('Actual vs Predicted LTV')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')  # Line y=x
plt.show()
In [92]:
# Plot Residuals
'''Plotting the residuals (differences between actual and predicted values) 
to help us identify patterns that might indicate model issues.'''

residuals = y_test - y_pred
plt.hist(residuals, bins=30)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residuals Distribution')
plt.show()
In [93]:
#Feature Importance
#Understanding which features are most important to our model can provide insights into the model’s decisions.

feature_importances = pd.Series(gb_regressor.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)

plt.bar(feature_importances.index, feature_importances)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()
In [96]:
#Lets add the neccesary columns in scaler_df for tesing.
scaler_df[['recency', 'frequency', 'monetary']] = telecom_data[['recency', 'frequency', 'monetary']]
In [97]:
scaler_df[['recency', 'frequency', 'monetary']]
Out[97]:
recency frequency monetary
0 1611 -272 124962
1 1611 6074 130556
2 1611 266 148828
3 1611 9482 38722
4 1611 1486 55098
... ... ... ...
243548 393 4139 130580
243549 393 7646 82393
243550 393 6577 51298
243551 393 1220 83981
243552 392 3452 144297

243553 rows × 3 columns

In [98]:
# Lets test the model on scaler_df 
new_data_predictions = gb_regressor.predict(scaler_df[['recency', 'frequency', 'monetary']])
print(new_data_predictions)
[1.06318661e+08 3.44108148e+09 2.98132766e+08 ... 3.58163306e+08
 5.01801271e+07 7.05467931e+08]
In [100]:
# Identify categorical and numerical columns
categorical_cols = telecom_data.select_dtypes(include=['object']).columns
numerical_cols = telecom_data.select_dtypes(include=['int64', 'float64']).columns.drop('churn')

# Preprocess the data
numerical_transformer = SimpleImputer(strategy='median')
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define the model
gb_classifier = GradientBoostingClassifier(random_state=42)

# Create and evaluate the pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', gb_classifier)
])

# Prepare the features and target variable
X = telecom_data.drop(columns=['churn'])
y = telecom_data['churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
Accuracy: 0.7991418776046478
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix:
 [[38927     1]
 [ 9783     0]]
In [ ]:
# Ensure telecom_data is already loaded and preprocessed
# Define categorical and numerical columns
categorical_cols = telecom_data.select_dtypes(include=['object']).columns
numerical_cols = telecom_data.select_dtypes(include=['int64', 'float64']).columns.drop('churn')

# Define transformers
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Define preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define the model
gb_classifier = GradientBoostingClassifier(random_state=42)

# Create pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', gb_classifier)
])

# Prepare features and target variable
X = telecom_data.drop(columns=['churn'])
y = telecom_data['churn']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Ensure scaler_df contains necessary columns
if not all(col in scaler_df.columns for col in ['recency', 'frequency', 'monetary']):
    raise ValueError("scaler_df does not contain the necessary columns: 'recency', 'frequency', 'monetary'")

# Make predictions on the new data
new_data_predictions = clf.predict(scaler_df[['recency', 'frequency', 'monetary']])

# If you want the probabilities
new_data_probabilities = clf.predict_proba(scaler_df[['recency', 'frequency', 'monetary']])[:, 1]

# Print the predictions
print("Predicted Classes:", new_data_predictions)
print("Predicted Probabilities:", new_data_probabilities)
In [ ]: